We can now have a list of genes ordered according to their evidence
for being differentially-expressed. You should have saved a
results_TGF_vs_CTR_annotated.rds object in the previous
session.
library(DESeq2)
results_tgf <- readRDS("Robjects/results_TGF_vs_CTR_annotated.rds")
If you did not save the results from comparing TGF to
CTR, you can use an object from the course materials
library(DESeq2)
results_tgf <- readRDS("Robjects/results_TGF_vs_CTR_annotated_BACKUP.rds")
Now we have annotated our results, we can start to explore some common visualisation techniques. In the process we will hope to gain more insights into our results
A common plot for displaying the results of a differential expression analysis is a volcano plot. It is a scatter plot that shows statistical significance and the magnitude of difference between conditions. They are used to identify which genes are the most significant and are also changing by the most amount.
The data we need for the plot is contained in our
results_tgf data frame. This basic plot displays a point
for every gene, but does not take advantage of some of the other columns
in the data frame.
library(ggplot2)
results_tgf %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj))) + geom_point()
One modification is to colour the points according to whether each gene is significant in the analysis. The indicator of significance can be a new column in the data frame that we create on-the-fly using the pipe operator.
results_tgf %>%
mutate(Significant = padj < 0.05 & abs(log2FoldChange) > 2) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj), col=Significant)) + geom_point()
We can also add the gene names to the plot. This should be
straightforward as ggplot2 has a label aesthetic that can
be mapped to columns in a data frame. The geom_text plot
will then display the labels. However, the following plot is a bit
crowded.
## Not a good idea to run this!!
results_tgf %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj), label=SYMBOL)) + geom_point() + geom_text()
The problem here is that ggplot2 is trying to label every point with
a name; not quite what we want. The trick is to create a label that is
blank for most genes and only labels the points we are interested in.
The ifelse function in R is a convenient way to set the
entries in a vector based on a logical expression. In this
case, make the values in Label the same as the gene symbol
if the gene is in our list of “top genes”. Otherwise, points get labeled
with a blank string "".
For clarity, we also make the points slightly transparent and use a different colour for the text.
N <- 10
top_genes <- results_tgf$ENSEMBL[1:N]
results_tgf %>%
mutate(Label = ifelse(ENSEMBL %in% top_genes, SYMBOL, "")) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj), label=Label)) + geom_point(alpha=0.4) + geom_text(col="blue")
Finally, a slightly better positioning of text is given by the
ggrepel package.
if(!require(ggrepel)) install.packages("ggrepel")
results_tgf %>%
mutate(Label = ifelse(ENSEMBL %in% top_genes, SYMBOL, "")) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj), label=Label)) + geom_point(alpha=0.4) + geom_text_repel(col="blue")
The labeling of genes on the plot is not restricted to the most significant genes. We could use a similar approach to label any set of genes that we are interested in.
For instance, we could hypothesise that Extra-Cellular Matrix (ECM) genes play are important for the transformation in cells treated with TGF. We could use the volcano plot to demonstrate whether genes belonging to this pathway show evidence for being differentially-expressed
ENSEMBL IDs using
the AnnotationDbi::select functionYou may have already seen the use of a heatmap as a quality
assessment tool to visualise the relationship between samples in an
experiment. Another common use-case for such a plot is to visualise the
results of a differential expression analysis. Although
ggplot2 has a geom_tile function to make
heatmaps, specialised packages such as pheatmaps offer more
functionality such as clustering the samples.
The counts we are visualising are the variance-stablised counts, which are more appropriate for visualisation.
Here we will take the top 10 genes from the differential expression
analysis and produce a heatmap with the pheatmap package.
We can take advantage of the fact the our counts table contains Ensembl
gene names in the rows. Standard subset operations in R can then be
used.
The default colour palette goes from low expression in blue to high expression in red, which is a good alternative to the traditional red/green heatmaps which are not suitable for those with forms of colour-blindness.
dds <- readRDS("Robjects/dds_BACKUP.rds")
# pheatmap is a specialised package to make heatmaps
library(pheatmap)
top_genes <- results_tgf$ENSEMBL[1:10]
vsd <- vst(dds)
using 'avgTxLength' from assays(dds), correcting for library size
# top_genes is a vector containing ENSEMBL names of the genes we want to see in the heatmap
pheatmap(assay(vsd)[top_genes,])
The heatmap is more informative if we add colours underneath the
sample dendrogram to indicate which sample group each sample belongs to.
This we can do by creating a data frame containing metadata for each of
the samples in our dataset. With the DESeq2 workflow we
have already created such a data frame. We have to make sure the the
rownames of the data frame are the same as the column names of the
counts matrix.
sampleInfo <- as.data.frame(colData(dds)[,c("condition","Treated")])
pheatmap(assay(vsd)[top_genes,],
annotation_col = sampleInfo,
scale="row")
Any plot we create in RStudio can be saved as a png or pdf file. We
use the png or pdf function to create a file
for the plot to be saved into and run the rest of the code as normal.
The plot does not get displayed in RStudio, but printed to the specified
file.
png("heatmap_top10_genes.png",width=800,height=800)
pheatmap(assay(vsd)[top_genes,],
annotation_col = sampleInfo)
# dev.off()
There are many arguments to explore in pheatmap. For
example, we might want to use a specific order to the rows and columns
rather than using clustering. A useful option is to specific our own
labels for the rows (genes). The default is to use the rownames of the
count matrix. In our cases these are Ensembl IDs and not easy to
interpret.
gene_labels <- results_tgf$SYMBOL[1:N]
pheatmap(assay(vsd)[top_genes,],
annotation_col = sampleInfo,
labels_row = gene_labels,
scale="row")
Given the nature of how the genes were selected for the heatmap, we shouldn’t be surprised by the good separation that it demonstrates.
Instead, an un-biased approach can be used where we choose genes
based on their variability across the whole dataset. The variability of
a gene can be calculated using the convenient rowSds
function applied to the variance-stabilised counts in the
vsd object. However, some genes have very low expression
and were not tested by DESeq2. We can exclude these from
the visualisation
res_with_var <- mutate(results_tgf, GeneVar = rowSds(assay(vsd))) %>%
filter(!is.na(padj))
res_with_var from above by
decreasing GeneVarENSEMBL IDs for the first 100 rows. The
slice function from dplyr might be useful
heregenes_to_plot is the ENSEMBL IDs that you identifypheatmap(assay(vsd)[genes_to_plot,],
annotation_col = sampleInfo,
scale="row")
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both.
For a particular pathway we need to calculate how many genes were identified as differentially-expressed and compare to how many we would be expect by chance. Or in other words, if we repeatedly generated a list of differentially-expressed genes at random how many genes from this pathway would be expect to see.
For the ECM pathway we can extract all genes as follows:-
## The pull function from dplyr is used to extract a particular column
library(org.Hs.eg.db)
pathway_genes <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0030198",
keytype = "GO",
columns="ENSEMBL") %>% pull(ENSEMBL)
'select()' returned 1:many mapping
between keys and columns
We can then annotate each gene in our results according to whether it belongs to this pathway, and whether it is differentially-expressed.
go_table <- mutate(results_tgf,
inPathway = ENSEMBL %in% pathway_genes,
isDE = padj < 0.05 & abs(log2FoldChange) > 1)
go_table
Cross-tabulating the two new columns gives a basis for a statistical test
table(go_table$inPathway, go_table$isDE)
FALSE TRUE
FALSE 31031 547
TRUE 105 24
The Fisher’s exact test or chi-squared test (as seen here) can then be used
chisq.test(table(go_table$inPathway, go_table$isDE))
Pearson's Chi-squared test with
Yates' continuity correction
data: table(go_table$inPathway, go_table$isDE)
X-squared = 197.39, df = 1, p-value
< 2.2e-16
In reality it would be impractical to test all possible pathways in this manner, so there are a number of Bioconductor packages that automate the process
clusterProfiler is a Bioconductor package for
over-representation analysis. It’s main advantage is that it provides
some nice visualisation methods.
The main function is enrichGO which requires the IDs of
genes found to be differentially-expressed (sigGenes) and
the IDs of all genes in the dataset (universe). It
uses the org.Hs.eg.db package to map between gene names and
biological pathways.
library(clusterProfiler)
universe <- results_tgf %>% pull(ENSEMBL)
sigGenes <- results_tgf %>%
filter(padj < 0.05, !is.na(ENSEMBL)) %>% pull(ENSEMBL)
enrich_go <- enrichGO(
gene= sigGenes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP",
universe = universe,
qvalueCutoff = 0.05,
readable=TRUE
)
The result of enrichGo can be turned into a data frame
for easier interpretation.
enrich_go %>% data.frame
A dot plot can show us the most enriched pathways, and the size of each.
dotplot(enrich_go,showCategory=20)
Relationships between the identified categories can be found using
emapplot.
enrich_go <- enrichplot::pairwise_termsim(enrich_go)
emapplot(enrich_go)
Overlaps between gene sets can also be visualised using an “Upset plot” - an alternative to a venn diagram.
enrichplot::upsetplot(enrich_go)
An appealing feature of the GSEA method is that it
does not require us to impose arbitrary cut-offs on the dataset to
decide what is differentially-expressed or not. The steps in producing
the input required for GSEA are i) retrieving the ranked statistics ii)
naming each one according to a chosen identifier (ENSEMBL
or ENTREZID for example).
The clusterProfiler package also includes an
implementation of the GSEA algorithm, and the function works in much the
same way as enrichGO from above.
ranked_genes <- results_tgf %>%
arrange(desc(stat)) %>%
filter(!is.na(stat))
geneList <- pull(ranked_genes, stat)
names(geneList) <- pull(ranked_genes, ENSEMBL)
gse_GO <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "BP",keyType = "ENSEMBL")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
gse_GO %>% as.data.frame
An overview of the results can be provided by a “ridge plot”. This allows comparison of the test statistics for each of the top enriched pathways.
ridgeplot(gse_GO)
Picking joint bandwidth of 0.82
An upset plot can still be produced, but this time the distribution of statistics for overlapping categories can be produced.
enrichplot::upsetplot(gse_GO)
The results confirm that the ECM pathway has many differentially-expressed genes (more than we would expect by chance). Moreover, there is a tendancy for these genes to be up-regulated; as indicated by the high positive enrichment score. Another way to visualise the GSEA results, that is typically produced from the GSEA java app, is the so-called enrichment plot.
gseaplot(gse_GO,geneSetID = "GO:0030198")
The enrichment plot for a gene set with a high negative enrichment score reveals a different pattern.
gseaplot(gse_GO,geneSetID = "GO:0002283")
clusterProfiler can
also find enriched KEGG terms
using the enrichKEGG function. There are a couple of
changes that are required from enrichGO
ENTREZID have to be used as the identifer typehsa.enrichKEGG function to identify enriched KEGG
terms in the analysis.The Bioconductor package have the convenience of being able to make
queries offline. However, they are only available for certain organisms.
If your organism does not have an org.XX.eg.db package
listed on the Bioconductor annotation page (http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData),
an alternative is to use biomaRt which provides an interface to the
popular biomart annotation resource.
The first step is to find the name of a database that you want to connect to.
library(biomaRt)
listMarts()
ensembl=useMart("ENSEMBL_MART_ENSEMBL")
# list the available datasets (species). Replace human with the name of your organism
listDatasets(ensembl) %>% filter(grepl("Human",description))
ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)
Queries to biomaRt are constructed in a similar way to
the queries we performed with the org.Hs.eg.db package.
Instead of keys we have filters, and instead
of columns we have attributes. The list of acceptable
values is much more comprehensive that for the org.Hs.eg.db
package.
listFilters(ensembl) %>%
filter(grepl("ensembl",name))
listAttributes(ensembl) %>%
filter(grepl("gene",name))
An advantage over the org.. packages is that positional
information can be retrieved
attributeNames <- c('ensembl_gene_id', 'entrezgene_id', 'external_gene_name', "chromosome_name","start_position","end_position")
getBM(attributes = attributeNames,
filters = "ensembl_gene_id",
values=top_genes,
mart=ensembl)